This course aimed at generating professional graphics for publications with much better look!
The main package used here is called ggplot2: An Implementation of the Grammar of Graphics [how to write a graphic grammar sentence]
- Whiskers and box plot
- Whiskers and box plot overlaid with dot plot
- Violin plot
- Scatter plot
- Introducing faceting
- Line plot
- Error bars
- Histograms, histogram overlaid with density curve
- Density plot
Data loading:
Most probably, your data in an excel file.If so, the most easiest way is to transfer your data from the excel to R through clipboard.
to do that
- Paste this code in the console of R
data <- read.table(file="clipboard",header=TRUE,sep="\t")
- Open your excel sheet, highlight your data and press ctrl+C “to copy them”
- Put your cursor on your code (step 1), press ctrl+enter
- your data should be copied stored in R
- to confirm, in the upper right panel of R studio, press on “Environment”, you should see your data there.
Now we need to install and load ggplot2 package. the package can be directly downloaded from internet or from zip /tar.gz files stored on your local drive.
from internet: right in R studio click install,“select install from repository” type package name “ggplot2”, make sure independence is checked.
from local disk: instead of selecting install from repository, select package archive file and then select your zip file.
Package required
library(ggplot2)# the main player
library(reshape2)# reshaping and melting your data from wide to long
library(RColorBrewer)# color your graph in artistic way like leonardo da vinci :)
library(scales)# rescaling your axis
library(plyr)# data manipulation
library(dplyr)# data manipulation
Preparing the dataset
In this tutorial, i will use iris dataset. This dataset is integrated in the base package comes with r
head(iris)# checking the first 6 rows
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
str(iris)# check the dataset parameters
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)# have a quick look on the min, max mean,,etc
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Lets make another dataset called iris1 without species column
iris1 <- iris[,1:4]# selecting only columns from 1-4 and name the dataset iris1
head(iris1)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
The table format is not appropriate. This table format is called wide format, we need to change it into long format. To do that , we will use melt function (reshape2 package)
iris2 <- melt(iris1)
## No id variables; using all as measure variables
head(iris2)
## variable value
## 1 Sepal.Length 5.1
## 2 Sepal.Length 4.9
## 3 Sepal.Length 4.7
## 4 Sepal.Length 4.6
## 5 Sepal.Length 5.0
## 6 Sepal.Length 5.4
tail(iris2)
## variable value
## 595 Petal.Width 2.5
## 596 Petal.Width 2.3
## 597 Petal.Width 1.9
## 598 Petal.Width 2.0
## 599 Petal.Width 2.3
## 600 Petal.Width 1.8
Lets draw the plot
ggplot(iris2, aes(x=variable, y=value)) +
geom_boxplot(notch=FALSE, width=0.5) +
theme_bw() +
theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
With notch
ggplot(iris2, aes(x=variable, y=value)) +
geom_boxplot(notch=TRUE, width=0.5) +
theme_bw() +
theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
To understand the box plot, look to the following figure
The box shows the interquartile range (IQR). The IQR is the 25 to 75 percentile also known as (aka) Q1 and Q3. The IQR is where the center 50% of your data points will fall.
The whiskers add 1.5 times the IQR to the 75 percentile (aka Q3) and subtract 1.5 times the IQR from the 25 percentile (aka Q1). The whiskers should include 99.3% of the data if from a normal distribution.
The Line - Shows the median of the data.
The Notch - displays the a confidence interval around the median which is normally based on the median +/- 1.57 x IQR/sqrt of n.
Ok back to our code, lets tweak it more
ggplot(iris2, aes(x=variable, y=value, fill=variable)) +
geom_boxplot(notch=TRUE, width=0.5) +
theme_bw() +
theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
xlab("Flower species") +
ylab("measured value")
Lets read iris 3 dataset. this dataset is in long format but contains additional column called “new” contains nominal values from 1-5.
iris3 <- read.csv("files/iris3.csv",header=T)# loading dataset using csv loading code
str(iris3)# see that new col is integer
## 'data.frame': 600 obs. of 3 variables:
## $ variable: chr "Petal.Length" "Petal.Length" "Petal.Length" "Petal.Length" ...
## $ value : num 1 1 1.1 1.1 1.2 1.2 1.2 1.2 1.3 1.3 ...
## $ new : int 3 3 5 3 5 3 3 3 3 5 ...
unique(iris3$new)#calling unique values in new column
## [1] 3 5 1 2 4
#reset graphical device
dev.off()
## null device
## 1
ggplot(iris3, aes(x=variable, y=value,fill=as.factor(new))) +
geom_boxplot(outlier.colour=NA, width=.7,notch=T,fill="gray90")+
theme_bw() +
geom_dotplot(binaxis="y", binwidth=0.04, stackdir="center") +
theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
xlab("Flower part") +
ylab("Values") +
guides(fill=guide_legend(title="Rank of value"))
I can move the legend position anywhere. Note the difference in code
ggplot(iris3, aes(x=variable, y=value,fill=as.factor(new))) +
geom_boxplot(outlier.colour=NA, width=.7,notch=T,fill="gray90") +
theme_bw() +
geom_dotplot(binaxis="y", binwidth=0.04, stackdir="center") +
theme(text = element_text(size=20, face="bold", colour="black"), axis.text.x=element_text(vjust=2)) +
xlab("Flower part") +
ylab("Values") +
theme(legend.position='inside', legend.position.inside = c(1, 1), legend.justification=c(1,1)) +
guides(fill=guide_legend(title="Rank of value"))
This graph adds 3rd variable to your graph, now u have x, y and z represents color of distribution say (rank of value)
Figure output should be like this
A violin plot is a method of plotting numeric data. It is a box plot with a rotated kernel density plot on each side
Lets use iris 3 dataset.
ggplot(iris3, aes(x=variable, y=value)) +
geom_violin(fill="gray") +
geom_boxplot(width=0.2, fill="black", outlier.colour=NA) +
stat_summary(fun.y=median, geom="point", fill="white", shape=21, size=4) +
theme_bw() +
theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#reset graphical device
dev.off()
## null device
## 1
Diamonds is a dataset of prices of 50.000 round cut diamonds built in the ggplot packages, lets see it
head(diamonds,n=10)# calling first 10 rows
## # A tibble: 10 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Lets scatter plot the carat against its price
ggplot(diamonds, aes(x=carat,y=price)) +
geom_point(size=5,alpha=0.5) +
theme_bw() +
theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
Lets add another piece of information to the scatter, the color. so i will plot each spot with its designated color in the table
Lets see first what diamonds color is?
unique(diamonds$color)# calling unique values in color column
## [1] E I J H F G D
## Levels: D < E < F < G < H < I < J
Well in diamonds, there is 7 colors
ggplot(diamonds, aes(x=carat,y=price,fill=color)) +
geom_point(shape=21,size=5,alpha=0.5) +
theme_bw() +
theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
Note the change in the code, i added fill = color, and also i changed the shape of the dot to 21 “which can be filled by color”. the color is a discrete value but R can also color continuous values, lets see how
Lets say we would like to color the spots here based on the table (length to width ratio) in diamonds dataset
ggplot(diamonds, aes(x=carat,y=price,fill=table)) +
geom_point(shape=21,size=5,alpha=0.5) +
theme_bw() +
theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
Note the legend changed to scale
Also geometric rug is a rug added to the margins of the graph to define density
ggplot(diamonds, aes(x=carat,y=price,fill=table)) +
geom_point(shape=21,size=5,alpha=0.5) +
geom_rug(position="jitter", size=.01) +
theme_bw() +
theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Well, so far so good ?!?!
In some circumstances we want to plot relationships between set variables in multiple subsets of the data with the results appearing as panels in a larger figure. This is a known as a facet plot. This is a very useful feature of ggplot2. The faceting is defined by a categorical variable or variables. Each panel plot corresponds to a set value of the variable.
Lets back to our plot carat versus price and filled by color, this one
ggplot(diamonds, aes(x=carat,y=price,fill=color)) +
geom_point(shape=21,size=5,alpha=0.5) +
theme_bw() +
theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
head (diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Note that this plot represents all carat, price relationship with its color. These data is not separated by cut, “in another word you can not get any info from this graph about the cut”. Facetting subset the data into several plots based on another variable
Lets see, i will segregate this graph into several graphs based on the cut
ggplot(diamonds, aes(x=carat,y=price,fill=color)) +
geom_point(shape=21,size=5,alpha=0.5) +
theme_bw() +
theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
facet_wrap(~cut,nrow=2)
With geom rug
ggplot(diamonds, aes(x=carat,y=price,fill=color)) +
geom_rug(position="jitter", linewidth=.01) +
geom_point(shape=21,size=5,alpha=0.5) +
theme_bw() +
theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
facet_wrap(~cut,nrow=2)
R has a powerfull coloring palette, ther is a package called “RColorBrewer”. of course u can define your own palette “however, im not discussing this issue right now”
display.brewer.all()
Lets change this fancy colors into different one. but before that i will name the previous graph code as “graph1”
graph1 <- ggplot(diamonds,aes(x=carat,y=price,fill=color)) +
geom_rug(position="jitter", size=.01) +
geom_point(shape=21,size=5,alpha=0.5) +
theme_bw() +
theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
facet_wrap(~cut,nrow=2)
Typing graph1 will generate the graph
graph1 + scale_fill_brewer(palette="Purples")# i will choose purple coloring, i like purple!
We can see that colorless diamond “more expensive;D” is mostly available in smaller carat and difficult to be in bigger sized diamond. makes sense!
Do you know why diamond is expensive? its chemical formula is “C” pure carbon”
$30.6 sold in 2013 [most largest and expensive diamond]
In scatter plot, you can control shape and size of the spots based on a certain variable, lets see ,,
I will plot carat vs price and i will change the size based on x column (length)
ggplot(diamonds, aes(x=carat,y=price,size=x)) +
geom_point(shape=21,alpha=0.5) +
theme_bw() +
theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
Note i removed point size from the geom_point
size is better for continuous variable
or i can change the shape also
ggplot(diamonds, aes(x=carat,y=price,shape=cut)) +
geom_point(size=3.5) +
theme_bw() +
theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
## Warning: Using shapes for an ordinal variable is not advised
I can change the y axis into log scale. this require a package called scales
graph2 <- ggplot(diamonds,aes(x=carat,y=price,fill=color)) +
geom_point(shape=21,size=5,alpha=0.5) +
theme_bw() +
theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
graph2 + scale_y_continuous(trans=log2_trans())# log 10 works also
Turns y to log with visually-diminishing spacing [requires scale library]
graph2 + coord_trans(y="log2")
This is another twaek to adjust y axis into scientific appeal
graph2 +
scale_y_continuous(trans = log2_trans(),breaks = trans_breaks("log2", function(x) 2^x), labels = trans_format("log2", math_format(2^.x)))
Lets use mtcars dataset (datasets package)
head(mtcars,n=10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Now, lets plot a line graph for mgp as x and disp as y
ggplot(mtcars,aes(x=mpg,y=disp)) +
geom_line() # simple base code
# note that x data are continuous here
R understand here that your data input is numeric (continuous). in some cases u may need to tell R that your data is categorical (discrete), lets see how for the same graph
ggplot(mtcars,aes(x=factor(mpg),y=disp,group=1)) +
geom_line()
Can you see the difference?
We can tweak the code by adding points
ggplot(mtcars,aes(x=factor(mpg),y=disp,group=1)) +
geom_line() +
geom_point()
In line plot, we may need to plot different data with colors, lets do that
head(mtcars,n=3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Suppose we want to plot mpg and disp based on gear column, we can do that
ggplot(mtcars,aes(x=mpg,y=disp,colour=factor(gear))) +
geom_line() +
geom_point()
# note u need to tell R that gear are factor in order to be plotted independently
As i showed earlier, u can control color and shape to introduce more informative variables. I will add the information of (am column) by changing the shape of the points
ggplot(mtcars,aes(x=mpg,y=disp,colour=factor(gear),shape=factor(am))) +
geom_line() +
geom_point(size=4)
Control line from geom_line
ggplot(mtcars,aes(x=mpg,y=disp,colour=factor(gear),shape=factor(am))) +
geom_line(size=1.5) +
geom_point(size=4)
In order to understand how to draw a graph with error bars, lets create a simple data frame
a=c("a","a","a","a","b","b","b","b","b","c","c","c","c","c","c")
b=c(1,2,3,4,5,6,4,4,1,2,3,4,5,6,7)
c=c(23,32,23,34,56,13,12,13,13,24,56,23,21,12,31)
d=c(23,43,54,54,56,67,65,34,15,67,87,65,43,46,45)
f=c("m","f","m","f","m","m","f","f","f","m","f","m","f","m","f")
data=data.frame(a=a,b=b,c=c,d=d,f=f)
data
## a b c d f
## 1 a 1 23 23 m
## 2 a 2 32 43 f
## 3 a 3 23 54 m
## 4 a 4 34 54 f
## 5 b 5 56 56 m
## 6 b 6 13 67 m
## 7 b 4 12 65 f
## 8 b 4 13 34 f
## 9 b 1 13 15 f
## 10 c 2 24 67 m
## 11 c 3 56 87 f
## 12 c 4 23 65 m
## 13 c 5 21 43 f
## 14 c 6 12 46 m
## 15 c 7 31 45 f
We need to summarize the data to include the means , standard deviation and standard error. I will use dplyr package to get this info then i will use it for the plot.
ddplyr function: Split data frame, apply function, and return results in a data frame
Undertand the code please !. I ask R to split the data based on a, then summarize the b column to generate mean, median and standard error.
If the data includes NA. or missing data, R wont be able to generate these summaries (like mean or SE), thats why, im telling R if you find NA. ignore it (“!is.na” means not NA.)
Lets see i will call the summarized dataframe “sum”
sum <- ddply(data, c("a"),
summarise,
mb = mean(b, na.rm=TRUE),
medb=median(b, na.rm=TRUE),
sd = sd(b, na.rm=TRUE),
n = sum(!is.na(b)),
se = sd/sqrt(n))
Lets check it
sum
## a mb medb sd n se
## 1 a 2.5 2.5 1.290994 4 0.6454972
## 2 b 4.0 4.0 1.870829 5 0.8366600
## 3 c 4.5 4.5 1.870829 6 0.7637626
Now, lets draw the graph
ggplot(sum, aes(x=a,y=mb,fill=a)) +
geom_line(aes(group=1)) +
geom_point(shape=21, size=5) +
geom_errorbar(aes(ymin=mb-se,ymax=mb+se),width=.2) +
theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
Note that u need to tell R to connect the reads by line by adding (group1) in geom_line
Plotting multiple lines with error bars
In many cases, we need to draw many lines for different groups, next example i will summarize the data as earlier but with additional point, i will ask R to group based on 2 variables; the forst variable represent the data points (as earlier), the second variable represents the group for different lines. I will call the output sum1
sum1 <- ddply(data, c("a","f"),
summarise,
mb = mean(b, na.rm=TRUE),
medb=median(b, na.rm=TRUE),
sd = sd(b, na.rm=TRUE),
n = sum(!is.na(b)),
se = sd/sqrt(n))
Here is the graph. Note that im telling R to group based on F different “lines”. Moreover to avoid overlapping between same points, i will tell R to shift each group by 0.3 “position=position_dodge(.3)”.
ggplot(sum1, aes(x=a,y=mb,fill=a,group=f)) +
geom_line(position=position_dodge(.3)) +
geom_point(shape=21,size=5,position=position_dodge(.3)) +
geom_errorbar(aes(ymin=mb-se,ymax=mb+se),width=.2,position=position_dodge(.3)) +
theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
ylab("b column data")
Of course i can change the line pattern based on group
ggplot(sum1, aes(x=a,y=mb,fill=a,group=f,linetype=f)) +
geom_line(position=position_dodge(.3)) +
geom_point(shape=21,size=5,position=position_dodge(.3)) +
geom_errorbar(aes(ymin=mb-se,ymax=mb+se),width=.2,position=position_dodge(.3)) +
theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
ylab("b column data")
Or the line color, line width, error bar width, etc # play with numbers
ggplot(sum1, aes(x=a,y=mb,fill=a,group=f,color=f)) +
geom_line(position=position_dodge(.3),size=1.5) +
geom_point(shape=21,size=5,position=position_dodge(.3)) +
geom_errorbar(aes(ymin=mb-se,ymax=mb+se),size=1.5,width=0.2,position=position_dodge(.3)) +
theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
ylab("b column data")
Histogram differes from barplot in one important issue. REMEMBER, histogram always summarize counts on Y axis.
Lets use diamonds data
histogram takes a column and counts the repeat of each unique value, plot them.
Lets histogram diamonds price
ggplot(diamonds,aes(x=price)) +
geom_histogram() # the base code
Understanding binwidth
When ploting histogram, you need to know about binwidth, which is the window where numbers are counted
Assume you have these numbers (1,1.3,1.6,3,4,5.4) setting binwidth to 1, count will be from 0-1, 1-2, 2-3 so previous example will be
| bin | count |
|---|---|
| 1-2 | 3 |
| 2.1-3 | 1 |
| 3.1-4 | 1 |
| 4.1-5 | 0 |
| 5.1-6 | 1 |
Setting bin from 1 to 2 will give different count
| bin | count |
|---|---|
| 1-3 | 4 |
| 3.1-5 | 1 |
| 5.1-7 | 1 |
Pattern should be same. Ok lets change in the binwidth and see [with some tweaks]
ggplot(diamonds,aes(x=price)) +
geom_histogram(binwidth=100,fill="white",color="black")
hist <- ggplot(diamonds,aes(x=price)) +
geom_histogram(binwidth=500,fill="gold",color="red")
In general, i don’t like grey background of ggplot, so i decided to make my own setting
mytheme <- theme_bw() +
theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
So i can apply it on any named graph
hist + mytheme
So far so good ?
Again, i can separate the histogram output based on the “cut” in the diamond data
head(diamonds,n=2)
## # A tibble: 2 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
ggplot(diamonds,aes(x=price,fill=factor(cut))) +
geom_histogram(color="black")
Or i can draw the graph as interleaved multiple histogram. Note that here i told R to do that by adding “position=dodge”. also note that “alpha controls transparency”
ggplot(diamonds,aes(x=price,fill=factor(cut))) +
geom_histogram(position="dodge",alpha=0.4,color="black")
Data also can be separated in different panels . do you remember how? facetting
ggplot(diamonds,aes(x=price,fill=factor(cut))) +
geom_histogram(alpha=0.4,color="black") +
facet_wrap(~cut)
To draw a histgram overlaid with density
ggplot(diamonds, aes(x=price, y=..density..)) +
geom_histogram(fill="cornsilk",color="grey50",size=.2) +
geom_density()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(diamonds, aes(x=price, y=..density..,fill=color)) +
geom_histogram(color="grey50",size=.2) +
geom_density(alpha=0.1) +
facet_wrap(~color)
End of session
The more you give, the more you get:)